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Abstract —Performing signal processing tasks on compressive 
measurements of data has received great attention in recent years. 
In this paper, we extend previous work on compressive dictionary 
learning by showing that more general random projections may 
be used, including sparse ones. More precisely, we examine 
compressive K-means clustering as a special case of compressive 
dictionary learning and give theoretical guarantees for its perfor¬ 
mance for a very general class of random projections. We then 
propose a memory and computation efficient dictionary learning 
algorithm, specifically designed for analyzing large volumes of 
high-dimensional data, which learns the dictionary from very 
sparse random projections. Experimental results demonstrate 
that our approach allows for reduction of computational complex¬ 
ity and memory/data access, with controllable loss in accuracy. 

I. Introduction 

There are several ways to represent low-dimensional struc¬ 
ture of high-dimensional data, the best known being principal 
component analysis (PCA). However, PCA is based on a linear 
subspace model that is generally not capable of capturing the 
geometric structure of real-world datasets (l). 

The sparse signal model is a nonlinear generalization of the 
linear subspace model that has been used in various signal 
and image processing tasks ||2]-|j4), as well as compressive 
sensing [5j. This model assumes that each data sample can be 
represented as a linear combination of a few elements (atoms) 
from a dictionary. Data-adaptive dictionary learning can lead 
to a much more compact representation than predefined dictio¬ 
naries such as wavelets, and thus a central problem is finding 
a good data-adaptive dictionary. 

Dictionary learning algorithms such as the method of opti¬ 
mal directions (MOD) [6j and the K-SVD algorithm (7| aim 
to learn a dictionary by minimizing the representation error 
of data in an iterative procedure involving two steps of sparse 
coding and dictionary update. The latter often requires ready 
access to the entire data available at a central processing unit. 

Due to increasing sizes of datasets, not only do algorithms 
take longer to run, but it may not even be feasible or practical 
to acquire and/or hold every data entry. In applications such as 
distributed databases, where data is typically distributed over 
an interconnected set of distributed sites (8), it is important to 
avoid communicating the entire data. 

A promising approach to address these issues is to take a 
compressive sensing approach, where we only have access to 


compressive measurements of data. In fact, performing signal 
processing and data mining tasks on compressive versions 
of the data has been an important topic in the recent liter¬ 
ature. For example, in 0, certain inference problems such 
as detection and estimation within the compressed domain 
have been studied. Several lines of work consider recovery 
of principal components [ 101—1131, spectral features |14||, and 
change detection 151 from compressive measurements. 

In this paper, we focus on the problem of dictionary learning 
based on compressive measurements. Our contributions are 
twofold. First, we show the connection between dictionary 
learning in the compressed domain and K-means clustering. 
Most standard dictionary learning algorithms are indeed a 
generalization of the K-means clustering algorithm fib) , where 
the reference to K-means is a common approach to analyze 
the performance of these algorithms 0, ( 13 . This paper 
takes initial steps towards providing theoretical guarantees for 
recovery of the true underlying dictionary from compressive 
measurements. Moreover, our analysis applies to compressive 
measurements obtained by a general class of random matrices 
consisting of i.i.d. zero-mean entries and finite first four 
moments. 

Second, we extend the prior work in 1181 where compressive 
dictionary learning for random Gaussian matrices is consid¬ 
ered. In particular, we propose a memory and computation 
efficient dictionary learning algorithm applicable to modern 
data settings. To do this, we learn a dictionary from very 
sparse random projections, i.e. projection of the data onto 
a few very sparse random vectors with Bernoulli-generated 
nonzero entries. These sparse random projections have been 
applied in many large-scale applications such as compressive 
sensing and object tracking j 19), (20) and to efficient learning 
of principal components in the large-scale data setting p3) . To 
further improve efficiency of our approach, we show how to 
share the same random matrix across blocks of data samples. 


II. Prior Work on Compressive Dictionary 
Learning 

Several attempts have been made to address the problem of 
dictionary learning from compressive measurements. In three 
roughly contemporary papers [211. [22), and our work (Hi- 
three similar algorithms were presented to learn a dictionary 







based on compressive measurements. Each was inspired by the 
well-known K-SVD algorithm and closely followed its struc¬ 
ture, except in that each aimed to minimize the representation 
error of the compressive measurements instead of that of the 
original signals. The exact steps of each algorithm have minor 
differences, but take a similar overall form. 

However, none of these works explicitly aimed at designing 
the compressive measurements (sketches) to promote the com¬ 
putational efficiency of the resulting compressive K-SVD, so 
that it would be maximally practical for dictionary learning 
on large-scale data. Moreover, none of these works gave 
theoretical performance analysis for such computationally- 
efficient sketches. 

In this paper, we extend the previous line of work on com¬ 
pressive dictionary learning by analyzing the scheme under 
assumptions that make it memory and computation efficient. 
The key to the efficiency of the new scheme is in considering 
a wider and more general class of random projection matrices 
for the sketches, including some very sparse ones. We further 
introduce an initial analysis of the theoretical performance 
of compressive dictionary learning under these more general 
random projections. 

In this section, we review the general dictionary learning 
problem and the compressive K-SVD (CK-SVD) algorithm 
that was introduced in © for the case of random Gaussian 
matrices. (We note that the approaches of |22j and m are 
similar.) Given a set of n training signals X = [xi,..., x n ] 
in RR, the dictionary learning problem is to find a dictionary 
D £ R px K that leads to the best representation under a strict 
sparsity constraint for each member in the set, i.e., minimizing 

n 

min Y, ||xj — DcjUg s.t. Vi, ||cj || 0 < T (1) 


where C = [ci,..., c„] is the coefficient matrix and the £q 
pseudo-norm j c,| | (| counts the number of nonzero entries of 
the coefficient vector c, £ R K . Moreover, the columns of the 
dictionary D = |d|..... d^-] are typically assumed to have 
unit £ 2 -norm. Problem 0 is generally intractable so we look 
for approximate solutions (e.g., via K-SVD |7]|). 

We then consider compressed measurements (sketches), 
where each measurement is obtained by taking inner products 
of the data sample x, £ RR with the columns of a matrix 
R., i.e., y t = TtJ x, with {R; j^-i £ R pxm , m < p, and 
Y = [yi,...,y„] £ R mxra . In [18|, the entries of R. are 
i.i.d. from a zero-mean Gaussian distribution, which is an 
assumption we drop in the current paper. 

Given access only to the compressed measurements y,; 
and not x., we attempt to solve the following compressive 
dictionary learning problem: 


min 

DeR pXir 

CgR'fXn 


n 

Y l|y* - Rf n>Ci ll^ s.t. 

2=1 


Vi, llc.Ho <T 
Vfc, ||d fc || 2 = 1 


( 2 ) 


In the CK-SVD algorithm, the objective function in 0 
is minimized in a simple iterative approach that alternates 
between sparse coding and dictionary update steps. 


A. Sparse Coding 

In the sparse coding step, the penalty term in (|2]) is mini¬ 
mized with respect to a fixed D to find the coefficient matrix 
C under the strict sparsity constraint. This can be written as 


lly* - s.t.Wi, II c. || 0 < T ( 3 ) 

2=1 

where T, : = R/ D £ R m x K is a fixed equivalent dictionary 
for representation of y, : . This optimization problem can be 
considered as n distinct optimization problems for each com¬ 
pressive measurement. We can then use a variety of algorithms, 
such as OMP, to find the approximate solution c, |23). 


B. Dictionary Update 

The approach is to update the k' h dictionary atom d/, and its 
corresponding coefficients while holding d, fixed for j / k, 
and then repeat for k + 1, until k = I\ . The penalty term in 
0 can be written as 


n K 

Y, V' ^ ' Y c kjdj 

»=1 i=i 
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2 

n 
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( 4 ) 


where is the k th element of c. £ R K , Ik is a set of 
indices of compressive measurements for which / 0, and 
ei t k = y i — R^ cy 7 d ; £ R m is the representation error 

for y, when the k th dictionary atom is removed. The penalty 
term in 0 is a quadratic function of d/,. and the minimizer is 
obtained by setting the derivative with respect to d/ equal to 
zero. Hence, 


,R.Rj and b /; 


= E 




( 5 ) 

+t 


where G fc = Y,iez 

Therefore, we get the closed-form solution d/. = G ; h b 
where G^ denotes the Moore-Penrose pseudo-inverse of Gfc. 
Once given the new dfc (normalized to have unit (V'orm), 
the optimal c. k for each i £ Ik is given by least squares as 


R,fc — 


(e 


m d fc 


. By design, the support of the coefficient 


matrix C is preserved, just as in the K-SVD algorithm. 


III. Initial Theoretical Analysis: K-Means Case 

In this section, we provide an initial theoretical analysis of 
the performance of the CK-SVD algorithm by restricting our 
attention to a special case of dictionary learning: K-means 
clustering. In this special case, we can provide theoretical 
guarantees on the performance of CK-SVD at every step, in 
relation to the steps of K-means. Moreover, these guarantees 
will hold for a very general class of projection matrices 
including very sparse random projections. 











We consider a statistical framework to establish the connec¬ 
tion between CK-SVD and K-means. K-means clustering can 
be viewed as a special case of dictionary learning in which 
each data sample is allowed to use one dictionary atom (cluster 
center), i.e. T = 1, and the corresponding coefficient is set to 
be 1. Therefore, we consider the following generative model 

x 2 = d& ( 1 , i £ X k (6) 

where dfc is the center of the k th cluster, and {ei }” =1 £ 
represent residuals in signal approximation and they are drawn 
i.i.d. from ^-I pxp ), so that the approximation error is 

IE[11e|| 2 ] = cr 2 . The set {I k } k ~i ' s an arbitrary partition of 
( 1 , 2 ,..., n ), with the condition that \Ik\ —> 00 as n — > 00 . 
The random matrices R, are assumed to satisfy the following: 

Assumption 1. Each entry of the random matrices {R,}" =1 £ 
P xm is drawn i.i.d. from a general class of zero-mean 
distributions with finite first four moments {pk\ k= i- 


Theorem 2. Assume Assumption [7] Then, H/ defined in (|Sp 
converges to the identity matrix l pxp and for any r] > 0, we 
have 

pf l|H |~ I | | XpllF < v) > 1 ~ I\> (10) 

V ll A pxp||j7 / 

where 

Pq = —ppy—o ( K + 1 + P) ■ (11) 

m \T k | r/ z 

Also, consider the probabilistic model given in 0 and com¬ 
pressive measurements yj = R Jx-i. Then, f k defined in 0 
converges to the center of original data and for any 77 > 0 , 
we have 


where 



>1-Pi 


( 12 ) 


We will see that the distribution’s kurtosis is a key factor in 
our results. The kurtosis, defined as k = ^ — 3, is a measure 

M 2 

of peakedness and heaviness of tail for a distribution. 

We now show how, in this special case of K-means, CK- 
SVD would update the cluster centers. As mentioned before, 
in this case, we should set T = 1 and the correponding 
coefficients are set to be 1. This means that for all i £ I k , 
we have c lk = 1 , and Cjj = 0 for j 7 ^ k, and it leads to 
e i,k = y£ I k ■ Then, the update formula for the k th 
dictionary atom of CK-SVD given in 0 reduces to 


( ^ ICR/ )d.v - £ R,.v,. (7) 


Hence, similar to K-means, the process of updating K dictio¬ 
nary atoms becomes independent of each other. We can rewrite 
0 as H^dfc = ffc, where 


Hfc = 


1 1 


1 1 


A —'— tA- V R i R f, f fc -1—r 

mp 2 \Ik\ mp 2 \Ik\ 


iex k 


R y, (8) 


In 1131, it is shown that E[R*RT 


m/x^X pX p- Thus, we see 


E[ R iy*] = E [HiHfxj] 

= E [RiRfdfc] + E [RjRfej] 

= E [RjRf] d k + E [R,R r / ] E [e»] = mp 2 d k . (9) 

Therefore, when the number of samples is sufficiently large, 
using the law of large numbers, H/ and f/. converge to 
^E[RjRf] = Ip Xp and ^E^y^] = d fc . Hence, the 
updated dictionary atom in our CK-SVD is the original center 
of cluster, i.e. d/~ = d^, exactly as in K-means. Note that in 
this case even one measurement per signal m = 1 is sufficient. 

The following theorem characterizes convergence rates for 
H/ and ffc based on various parameters such as the number 
of samples and the choice of random matrices. 


Pl P ° + SNR ( P ° + \I k \ rj 2 ) (13) 

a || 5fc |[ 2 

and the signal-to-noise ratio is defined as SNR = j2 2 • 

We see that for a fixed error bound 77 , as \I k \ increases, 
the error probability If decreases at rate Jj-y. Therefore, 
for any fixed 77 > 0 , the error probability If goes to zero 
as |Ifc| —► 00 . Note that the shape of distribution, specified 
by the kurtosis, is an important factor. For random matrices 
with heavy-tailed entries, the error probability If increases. 
However, P 0 gives us an explicit tradeoff between |2fc|, the 
measurement ratio, and anisotropy in the distribution. For 
example, the increase in kurtosis can be compensated by 
increasing \I k \. The convergence rate analysis for ffc follows 
the same path. We further note that Pi is a decreasing function 
of the signal-to-noise ratio and as SNR increases. Pi gets 
closer to If, where for the case that SNR —► 00 , then Pi « If. 

Let’s consider an example to gain intuition on the choice 
of random matrices. We are interested in comparing the 
dense random Gaussian matrices with very sparse random 
matrices, where each entry is drawn from {— 1 , 0 ,+ 1 } with 
probabilities {J , 1 — 7, A-} for s > 1 (we refer to this 
distribution as a sparse-Bemoulli distribution with parameter 
s). {R ,:}" =1 £ R pxm , p = 100 and m / P = 0.3, are generated 
with i.i.d. entries both for Gaussian and the sparse-Bernoulli 
distribution. In Fig. |TJ we see that as \I k \ increases, Hfc gets 
closer to the identity matrix I pX p- Also, for a fixed \I k \, as the 
sparsity of random matrices increases, the kurtosis k = s — 3 
increases. Therefore, based on Theorem [2] we expect that 
the distance between Hfc and I pxp increases. Note that for 
Gaussian and the sparse-Bernoulli with s = 3, we have k = 0. 

As a final note, our theoretical analysis gives us valuable 
insight about the number of distinct random matrices required. 
Based on Theorem [2] there is an inherent tradeoff between the 
accuracy and the number of distinct random matrices used. 
For example, if we only use one random matrix, we are not 











Fig. 1. Closeness of H*. to I pXp defined as ||Hfc — Ipxplljr / l|Ipxp||_p- 
{Ri}^_i G R pXm , p = 100 and m /p = 0.3, are generated with i.i.d. entries 
both for Gaussian and the sparse-Bernoulli distribution. We see that as \X^\ 
increases, H& gets closer to I p xp- Also, for fixed \X^\, as the sparsity of 
random matrices increases, the kurtosis k, = s — 3 increases and consequently 
the distance between H& and I p x P increases. For Gaussian and the sparse- 
Bernoulli with s = 3, we have k = 0. We also plot the theoretical bound rj 
with Pq = 0.5 for the Gaussian case. 


able to recover the true dictionary as observed in ( 24 | |. Also, 
increasing the number of distinct random matrices improves 
the accuracy, as mentioned in ]2T) . Hence, we can reduce the 
number of distinct random matrices in large-scale problems 
where n = 0(p) with controlled loss in accuracy. 

IV. Memory and Computation Efficient Dictionary 
Learning 

Now, we return our attention to general dictionary learning. 
Inspired by the generality of the projection matrices in The¬ 
orem [2} we sketch using very sparse random matrices, and 
furthermore reduce the number of distinct random matrices to 
increase the efficiency of our approach. 

Assume that the original data samples are divided into L 
blocks X = [XW,..., X( L )], where X^ represents the I th 
block. Let R/ £ R pxm , m < p, represent the random matrix 
used for the I th block. Then, we have 


of the block structure and use Batch-OMP ( 25 ]| in each block 
which is significantly faster than OMP for each c, separately. 


A. Dictionary Update 

The goal is to update the k th dictionary atom <!/, for k = 
1 ,,K, while assuming that d ; , j J k, is fixed. The penalty 
term in fB} can be written as 


i=i 


L ni 
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( 16 ) 


where cf], is the k th element of £ R K , and ef \ £ R m is 

the representation error for the compressive measurement y 
when the k th dictionary atom is removed. The objective func¬ 
tion in ( [T6| is a quadratic function of d/ ;: and the minimizer 
is obtained by setting the derivative of the objective function 
with respect to d;, equal to zero. First, let us define X ^ 1 ' 1 as a 
set of indices of compressive measurements in the I th block 
using dfc. Therefore, we get the following expression 


Gfcdfc = b fc , G fc ^^4° R i R ? 


b ^-EE 
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iex 


(17) 

where .s[. is defined as the sum of squares of all the coef¬ 


ficients related to the k th dictionary atom in the I th block, 

«• = e, 6I «.(cS) 2 - 

Note that G/, can be computed efficiently: concatenate 
{ r i} 4 i ' n a matrix R = [R 1; R 2 ,..., R_l] £ and 

define the diagonal matrix S/, as 


Y w = RfX w , 1 < l < L 


( 14 ) 


where is the sketch of K <1 \ Each entry of {R/ }[ =] 


IS 


distributed on {—1,0, +1} with probabilities 1 — 4 , 4 j}. 
Here, the parameter s controls the sparsity of random matrices 
such that each column of {R/ \[ =] has - nonzero entries, on 
average. We are specifically interested in choosing m and s 
such that the compression factor 7 = — < 1 . Thus, the cost 
to acquire each compressive measurement is O^p), 7 < 1, 
vs. the cost for collecting every data entry 0 (p). 

Similarly, we aim to minimize the representation error as 

|Y W -RfDCW 


mm 

neR pxK ,ceR Kx 
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1=1 


th 
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s.t.Wi , 

(0 
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( 15 ) 

the I th block of the 


where c ■ 1 represents the i 

coefficient matrix As before, the penalty term in ( | I p is 
minimized in a simple iterative approach involving two steps. 
The first step, sparse coding, is the same as the CK-SVD 
algorithm previously described, except we can take efficient 


S fc 4 


diag( 4 1 V--,4 1) ; 


s (L) 

■ ’ b k > 
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( 18 ) 


repeated m times repeated m times 

where diag(z) represents a square diagonal matrix with the 
elements of vector z on the main diagonal. Then, we have 
G k = RS fc R T . 

Given the updated dfc, the optimal for all i £ T k \ is 


given by least squares as cf\. = 


(e«,R)'d fc ) 

P - ’ 


l ' k 1 ! R rd fc|l2 

V. Experimental Results 


V* £ X\ 


(0 


We examine the performance of our dictionary learning 
algorithm on a synthetic dataset. Our proposed method is 
compared with the fast and efficient implementation of K- 
SVD known as Approximate K-SVD (AK-SVD) | | 25 | that 
requires access to the entire data. We generate K = 15 
dictionary atoms in R p , p = 1000, drawn from the uniform 
distribution and normalized to have unit norm. A set of data 
samples {x^l^j 000 £ R p is generated where each sample 
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Fig. 2 . Results for synthetic data. Plot of successful recovery vs. (a) 
iteration number, and (b) time. Our method CK-SVD for varying compression 
factor 7 is compared with AK-SVD. We observe that our method is both 
memory/computation efficient and accurate for 7 = g and 7 = 7^. 


is a linear combination of three distinct atoms, i.e. T = 3, 
and the corresponding coefficients are chosen i.i.d. from the 
Gaussian distribution J\f (0,100). Then, each data is corrupted 
by Gaussian noise drawn from 7V"(0, 0.04I pxp ). 

CK-SVD is applied on the set of compressive measurements 
obtained by very sparse random matrices for various values 
of the compression factor 7 = g, 75 > 5 ^, . We set the 

number of blocks L = 250 and m /p = 0.1. Performance 
is evaluated by the magnitude of the inner product between 
learned and true atoms. A value greater than 0.95 is counted 
as a successful recovery. Fig. [2] shows the results of CK-SVD 
averaged over 50 independent trials. In practice, when T is 
small, the updates for are nearly decoupled, and we may 
delay updating <f'\ until after all K updates of d^. For T = 3, 
the accuracy results are indistinguishable. 

In Fig. [2] we see that our method is able to eventually reach 
high accuracy even for 7=io’ achieving substantial savings 
in memory/data access. However, there is a tradeoff between 
memory and computation savings vs. accuracy. Our method 
is efficient in memory/computation and, at the same time, 
accurate for 7 = | and 7 = 75 , where it outperforms AK- 
SVD if the time of each iteration is factored in. We compare 
with AK-SVD to give an idea of our efficiency, but note that 
AK-SVD and our CK-SVD are not completely comparable. In 
our example, both methods reach 100 % accuracy eventually 
but in general they may give different levels of accuracy. The 
main advantage of CK-SVD appears as the dimensions grow, 
since then memory/data access is a dominant issue. 
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